Efficient Web Data Mining with Standard XML Technologies
نویسندگان
چکیده
The problem of Web data extraction and XML-based methodology whose goal extends far beyond simple “screen scraping are discussed.” An ideal data extraction process is able to digest target Web databases that are visible only as HTML pages, and create a local, identical replica of those databases as a result. What is needed in this process is much more than a Web crawler and set of Web site wrappers. A comprehensive data extraction process needs to deal with such roadblocks such as session identifiers, HTML forms, and client-side JavaScript, and data integration problems such as incompatible datasets and vocabularies, and missing and conflicting data. Proper data extraction also requires a solid data validation and error recovery service to handle data extraction failures, which are unavoidable. In this paper we describe NDES, a software framework that makes significant advances in solving these problems and provides a platform for building a productionquality Web data extraction process. Key aspects of NDES are that it uses XML technologies for data extraction, including XHTML and XSLT, and provides access to the “deep Web.”
منابع مشابه
An Improved Web Mining Technique to Fetch Web Data Using Apriori and Decision Tree
World Wide Web is the largest source of information. Most of the data on the web is dynamic and is in unstructured form. It is becoming difficult to get the relevant data from the web. Data Mining is the field of computer science which is used to extract knowledge from very large amount of data. Web mining is the application of data mining, which implements various techniques of data mining to ...
متن کاملPartitions musicales et technologies web
This papers show that new web technologies such as SVG, DOM, AJAX and CSS, are now mature enough to allow browsing of musical scores with optimal quality for the graphical and ergonomical parts, together with XML powerfull standard data-mining tools. MOTS-CLÉS : AJAX, DOM, DTD, CSS, partitions musicales, MusicXML, SAX, SVG, web.
متن کاملA Framework For Extracting Information From Web Using VTD-XML‘s XPath
The exponential growth of WWW (World Wide Web) is the cause for vast pool of information as well as several challenges posed by it, such as extracting potentially useful and unknown information from WWW. Many websites are built with HTML, because of its unstructured layout, it is difficult to obtain effective and precise data from web using HTML. The advent of XML (Extensible Markup Language) p...
متن کاملThe TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme1
This paper describes the rationale and design of an XML-TEI encoded corpora compatible analysis platform for text mining called TXM. The design of this platform is based on a synthesis of the best available algorithms in existing textometry software. It also relies on identifying the most relevant open-source technologies for processing textual resources encoded in XML and Unicode, for efficien...
متن کاملThe TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme
This paper describes the rationale and design of an XML-TEI encoded corpora compatible analysis platform for text mining called TXM. The design of this platform is based on a synthesis of the best available algorithms in existing textometry software. It also relies on identifying the most relevant open-source technologies for processing textual resources encoded in XML and Unicode, for efficien...
متن کامل